Enable exporting monthly invoice to Iceberg#292
Conversation
|
Aren't the dataframes that we'll be exporting just the processed billable and non-billable dataframes? |
|
Is Billable, Cluster Assignment Source, Cluster Assignment Method Are three new columns that have been created from the backfilling process. going forward. We can just set Cluster Assignment Source and Cluster Assignment method to Something like "Regular Monthly Invoicing Cycle" or something like that. There is already an |
I am hesistant to include those columns in the Iceberg table. I believe they should just be removed, since they're not useful for anything in invoicing. We had those columns in the first place because it helped in constructing the initial iceberg table. As for what invoice to export, I fine with exporting the entire "main" dataframe, since that's equivalent to billable + nonbillable, and it is also what's currently in the initial iceberg table. |
|
We shouldn't have
|
During 2026-03 invoicing, a bug was found where the columns initialized by the New-PI credit processor (i.e `PI Balance` column), was being accessed by the PI-SU processor before it was initialized, causing an KeyError. To fix this, the codebase has been refactored to allow each processor to explicitly document which columns they initialize and use, defined in two new properties, `initializes_columns` and `operates_on_columns`. A helper function `_init_columns()` is added to initalize columns Unit test `tests/unit/processors/test_processor_list.py` is added to check each processor only uses columns that itself or previous processors initialized, and no column is initialized more than once Additionally, each column will now be encapsulated as a `InvoiceColumn` instance. `InvoiceColumn` contains the name, datatype, and default values for each column This will also enable stricter and clearer type enforcement for data entering and leaving the pipeline A new processor `ValidateInputColumnsProcessor` is added to check the input dataframe to the processing pipeline has prerequisite columns, and to cast to appropriate types The e2e test data has been updated to surface the bug that was found. It did not failed during the PR that introduced the bug [1] because the test data didn't have the right conditions to trigger the PI-SU processor Refactored unit tests to accomodate the new processor by adding a new base test class. [1] CCI-MOC#279
ad4efbf to
bfe5338
Compare
|
@knikolla @naved001 @jimmysway I now consider this PR ready for review. I have rebased it on top of the column tracking PR, as the current state of the e2e test won't allow the invoice to be exported to iceberg easily. Specifically, Iceberg does not allow exporting of columns with a |
Added new invoice `IcebergInvoice` to export invoice data to Iceberg tables The export process also includes a schema update step to allow updates to Iceberg table schema. New Iceberg integration test added to validate iceberg functionality E2E test updated to include iceberg exporting Both tests use a temporary sqlite catalog
bfe5338 to
bfbb760
Compare
|
@knikolla @naved001 @jimmysway I just learned that |
Closes #259. This PR is in draft, since I'm not too clear which dataframe we want to export. Also, I want consensus on the datatypes of the columns we want in the iceberg table. It seems this PR will depend on #285.
@knikolla @naved001 Regardless, for now, this draft should be enough to show you guys what the iceberg operations involved will look like, and the structure of code changes that will need to happen